Analysis and Modeling of MapReduce’s Performance on Hadoop YARN

نویسندگان

  • Qiuyi Tang
  • Thomas C. Bressoud
چکیده

With the rapid growth of technology, scientists have realized the challenge of efficiently analyzing large data sets since the beginning of 21 century. Increases in data volume and data complexity shift scientists’ focus to parallel, distributed algorithms running on clusters. In 2004, Jeffrey Dean and Sanjay Ghemawat from Google introduced a new programming model to store and process large data sets, called MapReduce[2]. Apache Hadoop, an opensource software framework, which uses MapReduce as its data-processing layer, was developed at Yahoo as early as 2006 and evolved to a stable platform by 2011. Although Hadoop has been widely used in industry, its performance characteristics are not well understood. This paper, following Hadoop’s workflow, analyzes the factors that influence the running time of each phase in Hadoop execution. Given those factors, our goal is to model the performance of MapReduce applications.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

ABS-YARN: A Formal Framework for Modeling Hadoop YARN Clusters

In cloud computing, software which does not flexibly adapt to deployment decisions either wastes operational resources or requires reengineering, both of which may significantly increase costs. However, this could be avoided by analyzing deployment decisions already during the design phase of the software development. Real-Time ABS is a formal language for executable modeling of deployed virtua...

متن کامل

Survey on Hadoop and Introduction to YARN

Big Data, the analysis of large quantities of data to gain new insight has become a ubiquitous phrase in recent years. Day by day the data is growing at a staggering rate. One of the efficient technologies that deal with the Big Data is Hadoop, which will be discussed in this paper. Hadoop, for processing large data volume jobs uses MapReduce programming model. Hadoop makes use of different sch...

متن کامل

MPJ Express Meets YARN: Towards Java HPC on Hadoop Systems

Many organizations—including academic, research, commercial institutions—have invested heavily in setting up High Performance Computing (HPC) facilities for running computational science applications. On the other hand, the Apache Hadoop software—after emerging in 2005— has become a popular, reliable, and scalable open-source framework for processing large-scale data (Big Data). Realizing the i...

متن کامل

Research of Performance of Distributed Platforms Based on Clustering Algorithm

With the deep development and application of Internet technology, data need to be processed more and more, when dealing with large amounts of data. Spark is a versatile high-performance and parallel computing framework, which can be applied to data mining. This paper is based on the parallelization of platforms’ K-means algorithm, by building a YARN cluster environment and making experiments to...

متن کامل

Hi-WAY: Execution of Scientific Workflows on Hadoop YARN

Scientific workflows provide a means to model, execute, and exchange the increasingly complex analysis pipelines necessary for today’s data-driven science. However, existing scientific workflow management systems (SWfMSs) are often limited to a single workflow language and lack adequate support for large-scale data analysis. On the other hand, current distributed dataflow systems are based on a...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2015